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Abstract 

We generalise the classical Pinsker inequality which relates variational 
divergence to KuUback-Liebler divergence in two ways: we consider ar- 
bitrary /-divergences in place of KL divergence, and we assume knowl- 
edge of a sequence of values of generalised variational divergences. We 
then develop a best possible inequality for this doubly generalised situa- 
tion. Specialising our result to the classical case provides a new and tight 
explicit bound relating KL to variational divergence (solving a problem 
posed by Vajda some 40 years ago). The solution relies on exploiting a 
connection between divergences and the Bayes risk of a learning problem 
via an integral representation. 

1 Introduction 

Divergences such as the KuUback-Liebler and variational divergence arise perva- 
sively. They are a means of defining a notion of distance between two probability 
distributions. The question often arises: given knowledge of one, what can be 
said of the other? For all distributions P and Q on an arbitrary set, the classical 
Pinsker inequality relates the KuUback-Liebler divergence KL{P, Q) and vari- 
ational divergence V{P,Q) by KL(P,Q) > Q)]^. This simple classical 
bound is known not to be tight. Over the past several decades a number of 
refinements have been given (see Appendix |A] for a summary of past work) . 

Vajda |31j posed the question of determining a tight lower bound on KL- 
divergence in terms of variational divergence. This "best possible Pinsker in- 



equality" takes the form 

L{V):= inf KL(P,Q), Fg[0,2). (1) 

V{P,Q)=V 

Recently Fedotov et al. jT] presented an implicit parametric solution of the form 
of the graph of the bound as where 

V{t) = i(l-(coth(t)-i)'), (2) 

LU) = log f — ^^+icoth(i) ^5 — . 

^ ^ ^ \smh{t) J ^ ' sinh2(i) 

One can generalise the notion of a Pinsker inequality in at least two ways: 1) 
replace KL divergence by a general /-divergence; and 2) bound the /-divergence 
in terms of the known values of a sequence of generalised variational divergences 
(defined later in this paper) (V7rJ"=i, T^i € (0,1). In this paper we study this 
doubly generalised problem and provide a complete solution in terms of explicit, 
best possible bounds. 

The main result is given below as Theorem [6] Applying it to specific /- 
divergences gives the following coroUarjQ 

Corollary 1 Let V = V{P^ Q) denote the variational divergence between the 
distributions P and Q and similarly for the other divergences in Ta&/e[7] below. 
Then the following bounds for the divergences hold and are tight: 

/i^ > 2 - \/4- F2. j>2\/lnff±^V 



I > (^-^)ln(2-y)+Q+^pn(2+y)-ln(2) 

T > ln(^)-ln(2) 

> W <nV^ + lV>llj^^ (3) 



KL > min 

/3e[y-2,2-v] 



^) In (f±i^) . (4) 

The proof of the main result depends in an essential way on a learning 
theory perspective. We make use of an integral representation of /-divergences 
in terms of DeGroot's statistical information — the difference between a prior 
and posterior Bayes risk|4]. By using the relationships between the generalised 
variational divergence and the 0-1 misclassification loss we are able to use an 
elementary but somewhat intricate geometrical argument to obtain the result. 

The rest of the paper is organised as follows. Section [2] collects background 
results upon which we rely. The main result of the paper is stated in Section [3] 
and its proof presented in in Section [4] Appendix [A| summarises previous work. 

^ The terms \V < 1] and [V > 1| are indicator functions and are defined below. 
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2 Background Results and Notation 



In this section we collect notation and background concepts and results we need 
for the main result. 



2.1 Notational Conventions 

The substantive objects are defined within the body of the paper. Here we 
collect elementary notation and the conventions we adopt throughout. We write 
X A y := min{x,y), x \J y := max(a;,?;) and \p\ = 1 if p is true and \p\ = 
otherwise. The generalised function 5{-) is defined by d{x)f{x)dx — /(O) 
when / is continuous at and a < < 6. For convenience, we will define 
dc{x) :— 6{x — c). The real numbers are denoted E, the non-negative reals K"*"; 
Sets are in calligraphic font: X. Vectors are written in bold font: a,a.,x G 
M™. We will often have cause to take expectations (E) over random variables. 
We write such quantities in blackboard bold: I, L, etc. The lower bound on 
quantities with an intrinsic lower bound (e.g. the Bayes optimal loss) are written 
with an underbar: L, L. Quantities related by double integration recur in this 
paper and we notate the starting point in lower case, the first integral with 
upper case, and the second integral in upper case with an overbar: 7, F, F. 



2.2 Csiszar /-divergences 

The class of /-divergences [1113] provide a rich set of relations that can be used 
to measure the separation of the distributions. An /-divergence is a function 
that measures the "distance" between a pair of distributions P and Q defined 
over a space X of observations. Traditionally, the /-divergence of P from Q is 
defined for any convex / : (0,oo) M. such that /(I) = 0. In this case, the 
/-divergence is 



lfiP,Q)^EQ 



f 



dP 
dQ 



X 



/(|H0 (5) 



when P is absolutely continuous with respect to Q and equal 00 otherwise 

All /-divergences are non-negative and zero when P — Q, that is, Q) > 
and if{P, P) = for all distributions P, Q. In general, however, they are not 
metrics, since they are not necessarily symmetric {i.e., for all distributions P and 
Q, I/(P, Q) = lf{Q,P)) and do not necessarily satisfy the triangle inequality. 

Several well-known divergences correspond to specific choices of the function 
/ [D §5]. One divergence central to this paper is the variational divergence 
V{P, Q) which is obtained by setting f{t) = \t — 1\ in Equation |5] It is the only 
/-divergence that is a true metric on the space of distributions over X |13j and 
gets its name from its equivalent definition in the variational form 

ViP,Q)^2\\P-Q\\oo :=2sup |P(A)-Q(A)|. (6) 

ACX 



Liese and Miescke |18l pg. 34] give a definition that does not require absolute continuity. 
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(Some authors define V without the 2 above.) Furthermore, the variational 
divergence is one of a family of "primitive" or "simple" /-divergences discussed 
in Section pT3| These are primitive in the sense that all other /-divergences can 
be expressed as a weighted sum of members from this family. 

Another well known /-divergence is the KuUback-Leibler (KL) divergence 
KL(P, Q), obtained by setting f{t) = iln(t) in Equation |5] Others are given in 
Tabled 

As already mentioned in the introduction, the KL and variational diver- 
gences satisfy the classical Pinsker's inequality which states that for all distri- 
butions P and Q over some common space X 

KL(P,Q)> i[y(F,Q)]2. (7) 

2.3 Integral Representations of /-divergences 

The main tool in our proof of Theorem [6] is the integral representation of /- 
divergences, first articulated by Osterreicher and Vajda [20 and Gutenbrunner 
[T^ . They show that an /-divergence can be represented as a weighted integral 
of the "simple" divergence measures 

VAP,Q)^lf^iP,Q), (8) 

where /^(i) '■— min{7r, 1 — tt} — min{l — tt, irt} for tt G [0, 1]. 

Theorem 2 For any convex / such that /(I) = 0, the f -divergence 1/ can be 
expressed, for all distributions P and Q, as 

If{P,Q)= f V^{P,Q)jf{7r)d7r (9) 
Jo 

where the (generalised) function 

vM l-J" (i^) ^ (10) 

Recently, this theorem has been shown to be a direct consequence of a gener- 
alised Taylor's expansion for convex functions [TTJIll]- 

Even when / is not twice differentiable, the convexity of / implies its con- 
tinuity and so its right-hand derivative f'j^ exists. In this case, 7 is inter- 
preted distributionally in terms of df'_^. For example, when /(<) = \t — 1| then 
f"{t) = 25{t - 1) and so 7/(77) = 2^,^(1 - 27r) = 16(5i (tt). 

The divergences 1^ for tt e [0,1] can be seen as a family of generalised 
variational divergences since, df'^{t) for any member of this family is Tr6{t— ■^^^) 
and so 7/^ = \6i^. Thus, for tt = i we have 7/^ = 4(5i, that is, four times 

the 7 function for variational divergence and so by (|9| we see that 

T/(P,Q) = 4Fi(P,Q). (11) 
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Theorem |2] shows that knowledge of the values of V^{P, Q) for all tt G [0, 1] 
is sufficient to compute the value of lf{P,Q) for any /-divergence, since the 
weight function 7 is dependent only on /, not P and Q. All of the generalised 
Pinsker bounds we derive are found by asking how knowledge of a the value of 
a finite number of {P, Q) constrains the overall value of 1/ (P, Q) . 

Table [T] summarises the weight functions 7 for a number of /-divergences 
that appear in the literature. These are used in the proof of specific bounds in 
Corollary [T| 

Before we can prove the main result, we need to establish some properties 
of the general variational divergences. In particular, we will make use of their 
relationship to Bayes risks for 0-1 loss. 



2.4 Divergence and Risk 

Let h{TT,P,Q) denote the 0-1 Bayes risk for a classification problem in which 
observations are drawn from X using the mixture distribution M = nP -I- (1 — 
7r)(3, and each observation x € X is assigned a positive label with probability 
r]{x) := Tr-j^{x). If r = r{x) £ {0, 1} is a label prediction for a particular a; e X, 
the 0-1 expected loss for that observation is 

L{r, n,p, q) — {I ~ n)qlr = 1] + Trpjr = 0]. 

where q = ^{x) and p — ^{x) are densities. Thus, the full expected 0-1 loss 
of a predictor r : X ^ {0, 1} is given by L(r, tt, P,Q) :— EM[L{r{x),TT,p{x),q{x))] 
and it is well known {e.g., [S]) that its Bayes risk is obtained by the Bayes op- 
timal predictor r*{x) := lri{x) > i]. That is, 

L(^,P,g) :=infL(r,^,P,g) =L(r*,7r,P,Q), (12) 

r 

where the infimum is taken over all (AZ-measurable) predictors r : X ^ {0, 1}. 
So, by the definition of ri{x) and noting that rj > ^ iS np > ^{irp -I- (1 — 7r)g) 
which holds iff 7rp > {1 — Tr)q we see that the 0-1 Bayes risk can be expressed as 

L(^,p,g) (13) 

= (1 - TT)EQ[l7rp > (1 - n)qj] + nEpHnp < (1 - n)qj]. 

We now observe that 

r f P\ //, [(1-7^)9, g(l-7r)<7rp 

\q/ \np, q{l - tt) > np 



we have established the 



and so by noting that Eq /^ (^^^ = Em iU [fj 
following lemma. 

Lemma 3 For all tt G [0, 1] and all distributions P and Q, the generalised 
variational divergence satisfies 

V^{P,Q) - (l-^)A7r-L(7r,P,Q). (14) 
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Thus, the value of K-(P, Q) can be understood via the 0-1 Bayes risk for a 
classification problem with label-conditional distributions P and Q and prior 
probability tt for the positive class. This relationship between /-divergence 
and Bayes risk is not new. It was established in a mo re g eneral setting by 



Osterreicher and Vajda [5D] (who note that the term in (14 1 is the statistical 
information for 0-1 loss) and later by Nguyen et al. [TO] . 

2.5 Concavity of 0-1 Bayes Risk Curves 

For a given pair of distributions P and Q the set of values for ]L(7r, P, Q) as tt 
varies over [0, 1] can be visualised as a curve as in Figure |2] 

Lemma 4 For all distributions P and Q, the function tt IL(7r, P, Q) is con- 
cave. 



Proof By (12 1 we have that 

L(7r,P,Q)=EM[i(r*,7r,p,g)]. 

Observe that 

L{r*,TT,p,q) = (1 - ^)(7l77 > i] + 7rp|77 < i] 

^ |(l-7r)g, 9(l-7r)<7rp 
I np, q{l — n) > irp 

= min{(l — 7r)(7, 7rp} 

and so for any p, q is the minimum of two linear functions and thus concave in 
TT. The full Bayes risk is the expectation of these functions and thus simply a 
linear combination of concave functions and thus concave. ■ 



The tightness of the bounds in the main result of the next section depend 
on the following corollary of a result due to Torgersen [28 . It asserts that any 
appropriate concave function can be viewed as the 0-1 risk curve for some pair 
of distributions P and Q. A proof can be found in [551 §6.3]. 

Corollary 5 Suppose X has a connected component. Let tp'- [Oj 1] ~^ [0,1] be 
an arbitrary concave function such that for all tt G [0, 1], < ipijr) < 7rA(l — tt). 
Then there exists P and Q such that L(7r, P, Q) = ip{Tr) for all tt S [0, 1]. 



3 Main Result 

We will now show how viewing /-divergences in terms of their weighted integral 
representation simplifies the problem of understanding the relationship between 
different divergences and leads, amongst other things, to an explicit formula 
for (0. 
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Fix a positive integer n. Consider a sequence < tti < 7r2 < • • • < 7r„ < 1. 
Suppose we "sampled" the value of T4(-P, Q) at these discrete values of n. Since 
TT 1-^ ^7r(-Pj Q) is concave, the piece- wise linear concave function passing through 
points 

{(7r„K.(P,Q))}r=i 

is guaranteed to be an upper bound on the variational curve (tt, Q))7re(o.i)- 
This therefore gives a lower bound on the /-divergence given by a weight func- 
tion 7. This observation forms the basis of the theorem stated below. 



Theorem 6 For a positive integer n consider a sequence < tti < 7r2 < 
7r„ < 1 . Let ttq and 7r„+i :~ 1 and for i = 0, . . . , n + 1 let 

^P, := (1 - TT,) Att,; - V^^{P,Q) 

(observe that consequently -00 = "(pn+i — 0). Let 
An ■■= ]a= (ai,...,a„) e M": 



< 



(15) 



< a,: < , I — I, ... ,n 



The set An defines the allowable slopes of a piecewise linear function majorizing 
TT 1-^ V^{P, Q) at each o/tti, . . . , 7r„. For a = (oi, . . . , a„) G v4„, Zei 



J 



0, , 



= {fc e {1, . . . , n} : TTfc < i < TTfc+i}. 

= P < +¥ = jl\ + li < 

= li<.M'^P^-ai'^i) + li>Mi>l-l-~o■i^l'^^-l) 



(16) 

(17) 
(18) 
(19) 
(20) 



for i = 0, . . . ,n + 1 and let 7/ be the weight corresponding to f given by 

For arbitrary If and for all distributions P and Q the following bound holds. 
If in addition X contains a connected component, it is tight. 



- ™P / (aa.i7i'-H/3a.i)7/(7r)d7r 



= min [(tta.iTTi+i -|-/3a,i)r/(7rj+i) - Q!a.ir/(7ri+i) 

a(EAn — 
1=0 

- (aa,i7r» + /?a,i) r/(7fi) + aa^J/ (tT^)] , 

whereTf{TT) :— '-ff{t)dt andff{7r) :— J^Tf{t)dt. 



(21) 
(22) 

(23) 
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Equation 23 follows from (22 1 by integration by parts. The remainder of the 
proof is in Section |4] Although (23) looks daunting, we observe: (1) the con- 



straints on a are convex (in fact they are a box constraint); and (2) the objective 
is a relatively benign function of a. 

Wh en n = 1 the result simplifies considerably. If in addition tti = | then 



by (111 we have V i {P , Q) = jV{P,Q). It is then a straightforward exercise to 
explicitly evaluate ( 22 1 , especially when 7/ is symmetric. The following theorem 
expresses the result in terms of V{P, Q) for comparability with previous results. 
The result for KL(P, Q) is a (best-possible) improvement on the classical Pinsker 
inequahty. 

Theorem 7 For any distributions P,Q on X, let V :— V{P,Q). Then the 
following bounds hold and, if in addition X has a connected component, are 
tight. 

When 7 is symmetric about | and convex, 

IfiP, Q) > 2 [f , (i - ^) + ^Tf (i) - f , (i)] (24) 
and r f and f f are as in Theorem 

This theorem gives the first explicit representation of the optimal Pinsker boundj^ 
By plotting both ([2| and (|4| one can confirm that the two bounds (implicit and 
explicit) coincide; see Figure [l] 

4 Proof of Main Result 

Proof (Theorem |6]) This proof is driven by the duality between the family 
of variational divergences VTr{P,Q) and the 0-1 Bayes risk L(7r, P, Q) given in 
Lemma [3] Given distributions P and Q let 

(/.(tt) = V^{P, Q) = TT A (1 - tt) - ^{n), 

where tp{^) = L(7r,P, Q). We know that ip is non-negative and concave and 
satisfies iP{tt) < tt A (1 — tt) and thus V'(O) = ■0(1) = 0- 
Since ^ 

WQ)= I H7rhf{^)d7T, (25) 
"'0 

I/(P, Q) is minimised by minimising (f) over all (P, Q) such that 

0(7ri) = (^i = TTi A (1 - TTi) - Ipi-Ki). 



A summary of existing results and their relationship to those presented here is given in 
appendix ^ 
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Figure 1: Lower bound on KL{P, Q) as a function of the variational divergence 
V{P,Q). Both the exphcit bound Q and Fedotorev et al.'s imphcit bound Q 
are plotted. 



Since Vi := (1 - tt^) A tt^ - T4i(P, Q) 
can be expressed in terms of ^ as: 



ip^T^i) the minimisation problem for 



Given {TTi,ipi)"^i find the maximal ip: [0, 1] — + [0, |] 
such that = i^i, i = 0, . . . , n + 1, 

V'(7r) < TT A (1 -tt), vr e [0,1], 
■0 is concave. 



(26) 
(27) 
(28) 
(29) 



This will tell us the optimal (j> to use since optimising over -0 is equivalent to 
optimising over L(-,P, Q). Under the additional assumption on X, Corollary |5] 
implies that for any ifj satisfying ( 27 1 , ( 28 ) and ( 29 1 there exists P, Q such that 
L(-, P, Q) = V'(')- This establishes the tightness of our bounds. 

Let ^ be the set of piece-wise linear concave functions on [0, 1] having n + 1 
segments such that e ^' ^ -0 satisfies ( 27 ) and ( 28 1 . We now show that in 



order to solve (26) it suffices to consider -0 € 
If g is a concave function on M, then let 

<ig{x) :={seM: g{y) < g{x) + {s.y - x) , y G M} 

denote the sup -differential of g at x. (This is the obvious analogue of the suh- 
differential for convex functions |23j.) Suppose is a general concave function 
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3% 

5% S, 



. Jt A (1-JT) 



It, 



- Admissible region for 
lines with slopes in A., 

- ^f,( Jt) for a particular a 



Figure 2: Illustration of construction of optimal ip{Tr) = L(7r. P, Q) in the proof 
of Theorem [b] The optimal ip is piece- wise linear such that = ipi, i = 

0,...,n+l. 



satisfying ([271 and (|28|. For i = 1 

Gt :={[0,1] 



3 9i ■■ TT, 



, n, let 

i-^ "i/^i G M is linear and 



Observe that by concavity, for all concave V' satisfying ( 27 1 and ( 28 1 , for all 
5eUr=iGf,5W>V'W, for^e[0,l]. 

Thus given any such ip, one can always construct 



V'*(7r) = min(gf (tt), . . . ,g^(7r)) 



(30) 



such that ip* is concave, satisfies (27) and > ^{tt), for all tt e [0, 1]. It 

remains to take account of pS] ). That is trivially done by setting 

1/^(71) = min(V/(7r),7r A (I-tt)) (31) 

which remains concave and piecewise linear (although with potentially one addi- 
tional linear segment). Finally, the pointwise smallest concave tp satisfying ( p7| 
and (28 ) is the piecewise linear function connecting the points (0,0), (tti, ipi), {1^2, ^^2), 
Let (7: [0, 1] ^ [0, I ] be this function which can be written explicitly as 



ff(7r) 



',.+ 1 



[[tt e [7rj,7rj+i]], 



i(7rm,V'm),(l,0). 
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where we have defined ttq :— 0, ipo ■— 0, 7r„+i := 1 and ipn+i ■= 0. 

We now explicitly parametrize this family of functions. Let pi: [0,1] 
K denote the afhne segment the graph of which passes through (7ri,-0i), i = 
0, . . . , n + 1. Write Pi(7r) = ajTr + b^. We know that Pi{TTi) = ■0^ and thus 

h = Tpi - a^TT,, i = 0,...,n+l. (32) 

In order to determine the constraints on a^, since g is concave and minorizes ip, 
it suffices to only consider (7ri_i, (7(7ri_i)) and (ni^i, g^ni^i)) for i ~ 1, . . . ,n. 
We have (for i = 1, . . . ,n) 

PiiT^t^i) > .g(7r.i-i) 

=> a^TTi^i + bi > i/'i-i 

=4> aiiTi^i +-)pi- aiiTi > tJj.^i 

Ui {m-i - -Ki) > i>i-i - i>i 
" V ' 

^ a, < —. (33) 

Similarly we have (for i = 1 , . . . , n) 

Pti-^i+i) > .g(7r.i+i) 

ajTTi+l + V'i - OiTTi > -01+1 
^ (TTi+l - TTi) > - Ipi 



>o 
a,; 



> (34) 



We now determine the points at which defined by (30) and (31 1 change slope. 
That occurs at the points vr when 

K(7r) =p,;+i(7r) 

^ (fll+l - aj)7r = V'i - "01+1 + fli+lTTi+l - fliTTi 
=> 77 = 

= : TTi 

for i = 0, . . . , n. Thus 

0(77) =_Pi(7r), TT e [7rj_i,7ri], z = 1, . . . 

Let a — (ai, . . . , a„). We explicitly denote the dependence of on a by writing 
ipa- Let 

0a(7r) :=7r A (1 - tt) - ipa{Tr) 

= Qfa.iTT + /3a,i, TT G [7fi_ l , TT,;] , i = 1, ... ,71 + 1, 
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where a £ An (see (15l), tt^, a^.i and /3a, i are defined by (18 1, (19 1 and (20 1 



respectively. The extra segment induced at index j (see (17^) is needed since 
TT TT A (1 — tt) has a slope change dX tt — \. Thus in general, (f)a is piece- wise 
linear with n + 2 segments (recall i ranges from to ri-|-2); if TTfc+i = \ for some 
/c S {1, . . . , n}, then there will be only n + 1 non-trivial segments. 
Thus 



E 

i=0 



(/!)a(7r) • [tt e ['K^, TTi+i]] : a G A„ 



is the set of cf) consistent with the constraints and An is defined in (151. Thus 



substituting into (251, interchanging the order of summation and integration 
and optimizing we have shown ( 22 1 . The tightness has already been argued: 



under the additional assumption on X, since there is no slop in the argument 



above since every ij) satisfying the constraints in (26 1 is the Bayes risk function 
for some {P,Q). ■ 



Proof (Theorem [t]) In this case n = 1 and the optimal tp function will be 
piecewise linear, concave, and its graph will pass through (ni^ipi). Thus the 
optimal (p will be of the form 

r 0, 7re[0,L]U[U,l] 
4>{7t) = < n-{aTr + b), tt G [L, ^] 

[ (1 - tt) - (a7r-|-&), TT e [i,C/]. 

where ani + b = ^i^b — ani and a e [— 2^/;!, 2'(/'i] (see Figure |3|. For 
variational divergence, tti — ^ and thus by (111 



V I V 

^/^l = TTl A (1 - TTl) - - = - - - 

and so (pi — V/A. We can thus determine L and U: 

aL + b ^ L 
=^ aL + tpi — oTTi ~ L 

=J> L 



aiTi — ipi 
a-1 



Similarly aU + b=l-U^U = ^"tV'^i'"'^ and thus 

1 

2 

I/(P, (3)> min / [(1 — a)7r — -01 -I- a7ri]7y (7r)d7r 

oe[-2i/'i,2V'l] J 



(35) 



+ / [(— a — l)7r — -01 + oTTi -I- l]7/(7r)d7r. 



(36) 
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Figure 3: The optimisation problem when n — I. Given ^i, there are many risk 
curves consistent with it. The optimisation problem involves finding the piece- 
wise linear concave risk curve e ^' and the corresponding = n A (1 — tt) — 
that maximises If. L and U are defined in the text. 

If 7/ is symmetric about tt = | and convex and tti = ^, then the optimal a = 0. 
Thus in that case, 

1 

I/(P,Q) > 2 /'(^-Vi)7/Wrf^ (37) 

= 2[(i~^i)r/(i) + fH^i)-f/(i)] 

= 2[fr,(i) + f,(i-^)-f,(i)]. (38) 



Combining the above with ( 35 1 leads to a range of Pinsker style bounds for 
symmetric ly: 

JefTrey's Divergence Since J{P, Q) = KL(P, Q) + KL(Q, P) we have 7(7r) = 
^2(1^^)2 . (As a check, /(t) = (t - l)ln(i), /"(t) *±i and so 7/(^) = 

^r(i^) = ^^.) Thus 



J(P,Q) > 2/^^ ^^(T^^^- 



(4V^i-2)(ln(^i)-ln(l-V'i)). 



Substituting jpi — }^ ~ gives 

'2 + 



J(P,Q) > yin 



V 



14 



Observe that the above bound behaves Uke V'^ for small V, and V In (f^l^) ^ 
y2 for V e [0,2]. Using the traditional Pinkser inequality {KL{P,Q) > V'^/2) 
we have 

J{P,Q) = KL{P,Q) + KL{Q,P) 
- 2 2 

Jensen-Shannon Divergence Here f{t) = | Int — ln(t + 1) + In 2 and 
thus 7/(7r) = ^f" (i^) = Thus 

= ln(l-'0i) - Vi In(l-V'i) + Viln'^i'i +ln(2). 
Substituting = | — ^ leads to 

I{P, Q) > - ^ ) ln(2 - y) + (i + ^) ln(2 + F) - ln(2). 

Hellinger Divergence Here f{t) = (i/t-l)^. Consequently 7/(77) = ^f" (^^) 

T-^ 2{{l-7,)/nf/^ = 2[7r(l-7r)]3/2 ^nd thuS 



2[7r(l-^)p/2' 
4:Vih{tpi - 1) + 2V1 - ^1 



y2Ty 

= 2 - \/4-F2. 
For small V,2- ^4-^2 f« 1/2/4 

Arithmetic-Geometric Meein Divergence In this case, f{t) = ^ In j • 

Thus f"{t) = and hence 7/(^) = i'-^) = IfM = and 

thus 

= -iln(l-^)-iln(V')-ln(2). 
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Substituting ipi = ^ ~ ^ gives 



Symmetric x^-Divergence In this case ^'(P, Q) = X^(^7 Q) + X^(Qj P) ^i^d 



thus (see below) 7/(7r) 



2 

(l-7r)3 



(As a check, from f{t) 



we 



have /"(t) = ^''*^3^''"-' and thus 7/(7r) = :^/" ("^T^) gi'^'^s the same result.) 



*(i",Q) > 2 / (^-Vi) 

2(1+4V^2_4^^) 



(l-7r)3 



c?7r 



V'i(Vi-i) 

Substituting = 5 - f gives ^{P, Q) > 

When 7/ is not symmetric, one needs to use (36 1 instead of the simpler (38 1. 
We consider two cases. 

X^-Divergence Here f{t) = {t — 1)^ and so f"{t) = 2 and hence 7(7r) = 

/"(- ; . 

(|36[) and evaluating the integrals we obtain 



/tt = ^ which is not symmetric. Upon substituting 2/7r for 7(7r) in 



X^(P,Q)>2 min 



ae[-2i/'i,2Vl]s. 



2^/>i-a 



l+4i/'i-4i/ii 
2^/>i-a-2 



= :J(a,i/'i) 

One can then solve ^J(a,^i) = for a and one obtains a* = 2-0i — 1. Now 
a* > —2ipi only if ■01 > One can check that when tpi < ^, then a 1— > J(a, "^i) 
is monotonically increasing for a S [—2ipi,2'ipi] and hence the minimum occurs 
at a* — ~2tpi. Thus the value of a minimising J(a, ipi) is 

a* = iVi > l/4l(2V'i - 1) + 1^1 < l/4l(-20i). 

Substituting the optimal value of a* into J(a,Tpi) we obtain 

J(a*,Vi) - 1^1 > 1/41(2 + 8^-2 _ 8^^) 

Substituting "01 = 5^1" ^^^d observing that y < 1 Vi > 1 /4 we obtain 



x'(P,Q)>I^<iF' + I^>il 

Observe that the bound diverges to 00 as 1^ —> 2. 



V 



{2^vy 
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Kullback-Leibler Divergence In this case we have f{t) = tint and thus 
fit) = 1/* and consequently 7/(7r) :^f" {^) = ^2(i_^) which is clearly 
not symmetric. From (pO]) we obtain 



KL(P, Q) > min (l - | - In 
+ (f+^i)ln(;,^). 



Substituting ipi — }j — ^ gives 



KL(P,Q)> min ,5,(1/), 



where 



^a(V^) = (^^) In (iEf^^) + (?^) In (i±y^ 
Set /3 := 2a and we have Q. 



5 Conclusion 

We have generalised the classical Pinsker inequality and developed best possible 
bounds for the general situation. A special case of the result gives an explicit 
bound relating KuUback-Liebler divergence and variational divergence. The 
proof relied on an integral representation of /-divergences in terms of statistical 
information. Such representations are a powerful device as they identify the 
primitives underpinning general learning problems. These representations are 
further studied in [25]. 

A History of Pinsker Inequalities 

Pinsker [3T] presented the first bound relating KL(P, Q) to V{P, Q): KL > 
and it is now known by his name or sometimes as the Pinsker-Csiszar-KuUback 
inequality since Csiszar [3] presented another version and KuUback [14] showed 
KL > + Much later Tops0e [23] showed KL > V^/2 + V^/SG + 

V^/270. Non-polynomial bounds are due to Vajda [3T]: KL > Lvajda(^) := 

log (I^f) ~ ^n<^ Toussaint [39] who showed KL > Lvajda(V") V {V'^/2 + 

V'^/36 + V^/288). 

Care needs to be taken when comparing results from the literature as differ- 
ent definitions for the divergences exist. For example Gibbs and Su [8] used a 
definition of V that differs by a factor of 2 from ours. There are some isolated 
bounds relating V to some other divergences, analogous to the classical Pinkser 
bound; Kumar |15j has presented a summary as well as new bounds for a wide 
range of symmetric /-divergences by making assumptions on the likelihood ra- 
tio: r < p{x) /q{x) < R < oo for all x £ X. This line of reasoning has also been 
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developed by Dragomir et al. [B] and Taneja [25J[21]. Tops0e [57] has presented 
some infinite series representations for capacitory discrimination in terms of tri- 
angular discrimination which lead to inequalities between those two divergences. 
Liese and Miescke [TBI P-48] give the inequality V < h^A — h? (which seems to 
be originally due to LeCam (161) which when rearranged corresponds exactly to 
the bound for in theorem [t] Withers [35] has also presented some inequalities 
between other (particular) pairs of divergences; his reasoning is also in terms of 
infinite series expansions. 

Arnold et al. [30] considered the case of n = 1 but arbitrary 1/ (that is 
they bound an arbitrary /-divergence in terms of the variational divergence). 
Their argument is similar to the geometric proof of Theorem |6j They do not 
compute any of the explicit bounds in theorem [t] except they state (page 243) 
X^{P, Q) > which is looser than 

Gilardoni [9] showed (via an intricate argument) that if /'"(I) exists, then 
1/ > — ■ He also showed some fourth order inequalities of the form 1/ > 
C2jV^ + CijV^ where the constants depend on the behaviour of / at 1 in a 
complex way. Gilardoni [10' TT presented a completely different approach which 
obtains many of the results of theorem 7r Gilardoni [11 improved Vajda's 

U^,,„A ol;„V,+l,, T^T tu /0^ ^ 1„ 2 \„ 2+V 



bound slightly to KL(P, Q) > In - ?^ln 

Gilardoni [TUl [H] presented a general tight lower bound for 1/ = 1/ {P, Q) in 
terms of 1^ = V{P, Q) which is difficult to evaluate explicitly in general: 

V / f[g],\kil/V))] .n9l\k{l/V))] \ 



2 V5« (fc(W)-i i-gL'm/V))J 



where k ^{t) ~ h i - — ^tt-t H — _ i . ) and of course k{u) = {k ^) ^{u); and 

g{u) = {u- l)f'{u) - f{u), gj^^[g{u)] = u for u > 1 and g-^[g{u)] = u for 
u < 1. He presented a new parametric form for 1/ = KL in terms of Lambert's 
W function. In general, the result is analogous to that of Fedotov et al. [7] in 
that it is in a parametric form which, if one wishes to evaluate for a particular 
V, one needs to do a one dimensional numerical search — as complex as ([4]). 
However, when / is such that If is symmetric, this simplifies to the elegant form 

1/ > (I^f) He presented explicit special cases for h'^, J, A and 

/ identical to the results in Theorem [7] It is not apparent how the approach of 
Gilardoni [THl E] could be extended to more general situations such as that in 
Theorem |6] (i.e. n > 1). 

Bolley and Villani [2] considered weighted versions of the Pinsker inequal- 
ities (for a weighted generalisation of Variational divergence) in terms of KL- 
divergence that are related to transportation inequalities. 



*We were unaware of these two papers until completing the results presented in the main 
paper. 
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